AD  A077965 


Research  Memorandum  78- 


ANALYSIS  OF  VARIANCE:  SELECTION  OF  A 
MODEL  AND  SUMMARY  STATISTICS 


Frederick  H.  Steinheiser,  Jr.,  and  Kenneth  I.  Epstein 


UNIT  TRAINING  &  EVALUATION  SYSTEMS  TECHNICAL  AREA 


D  D  C 

|i^i?irdr?nn  p'?! 

f£C  IZ  I9T9 


lEiEEinrE 


U.  S.  Army 

Research  Institute  for  the  Behavioral  and  Social  Sciences 


August  1978 


DlSTRlBXmON  STATEMEiVf  X 

Approved  for  public  telt-ase; 
Distribution  Unlinritod 


79  22  5  129 


u 


^  Project  Number 
2Q762722A764 


(k) 


^  Research  H— lorani^aw  78-17 


Unit  Training 
Evaluation 


3NALYSIS  OFJ/ARIANCE:  ^LECTION  OF  A  MODEL 
AND^SUMMARY  sf^ISTICS^ 


Frederick  H.  ^Steinheiser ,  Jr.^-i^P  Kenneth 

i'/E  stein 


1^: 


Submitted  by: 

Frank  J.  Harris,  Chief 

UNIT  TRAINING  AND  EVALUATION  SYSTEMS  TECHNICAL  AREA 


Ao«*a«ioiif»r 

vris  gsiajlI  ' 

DOC  TAB 

Ubeniiaunced  ,  \ 

Justiflcatloi: _ _ j 


Distribution/ 

Availability  Codes 

Avail  and/or 
iDlst  special 


Approved  by : 


A.  H.  Birnbaum 
Acting  Director 

Organizations  and  Systems  Research 
Laboratory 


Joseph  Zeidner,  Technical  Director 
(Designate) 

U.S.  Army  Research  Institute  for 
the  Behavioral  and  Social  Sciences 


Research  Memorandums  are  informal  reports  on  technical  research 
problems.  Limited  distribution  is  made,  primarily  to  personnel  engaged 
in  research  for  the  Array  Research  Institute. 


i-6  7  0^6 


ANALYSIS  OF  VARIANCE:  SELECTION  OF  A  MODEL 
AND  SUMMARY  STATISTICS 


-VT - — - 

j 

The  topics  of  this  paper  are  models  for  the  analysis  of  variance 
(ANOVA)  (fixed,  random,  or  mixed  models)  and  the  subsequent  summary 
statistics  (F  ratio,  quasi-F  ratio,  and  magnitude  of  treatment  effect) 
that  may  be  confuted  following  the  ANOVA.  ANOVA  is  a  useful  method 
for  assessing  the  statistical  significance  of  treatment  effects.  But 
the  significance  of  an  effect  is  a  function  of  two  decisions.  The 
first  decision  is  the  selection  of  a  model  and  an  appropriate  sampling 
plan  for  elements  within  each  of  the  treatment  factors.  The  second  de¬ 
cision  is  the  choice  of  summary  statistics  that  indicate  the  extent  of 
significeince  achieved.  In  this  paper,  comparisons  will  be  made  between 
models  and  between  summary  statistics.  Specific  issues  will  be  clari¬ 
fied  concerning  the  interpretation  of  results  when  various-models  and 
summary  statistics  are  used  on  the  same  set  of  data. 


A 


Selection  of  an  ANOVA  Model 


\ 


In  the  fixed-effects  model,  the  levels  of  the  independent  variables 
are  assumed  to  have  been  exhaustively  sampled.  No  generalization  beyond 
those  levels  sampled  is  intended  or  theoretically  permissible.  The  ran¬ 
dom  effects  model  assumes  that  the  selected  treatment  variables  have 
been  randomly  selected  from  a  very  large  population  of  such  variables. 
Generalization  of  results  from  the  random  sample  to  the  population  is 
allowed.  The  mixed  model  allows  both  fixed  and  random  factors  to  be 
studied  in  the  same  experiment,  and  the  results  for  each  factor  are  to 
be  interpreted  according  to  that  factor's  sampling  plan. 


The  choice  of  a  model  has  an  impact  on  the  probability  of  obtain¬ 
ing  the  observations  under  the  null  hypothesis  for  each  treatment 
(factor) .  Behavioral  research  is  particularly  vulnerable  to  the  choice 
of  a  model  because  often  the  investigator  can  use  only  a  limited  sample 
of  the  possible  number  of  stimuli  (items,  drug  doses,  etc.).  Further¬ 
more,  the  same  stimulus  set  may,  by  necessity,  be  given  to  all  subjects 
because  of  the  difficulty  in  creating  comparable  sets  of  stimuli. 

As  a  simple  hypothetical  experiment  (adapted  from  Clark,  1973) , 
suppose  that  two  classes  of  stimuli,  nouns  and  verbs,  are  individually 
shown  to  subjects.  The  purpose  is  to  see  if  the  subject  takes  the 
same  amount  of  time  to  identify  each  word  as  a  member  of  the  correct 


Presented  at  the  23rd  Conference  on  the  Design  of  Experiments  in  Army 
Research,  Development,  and  Testing;  Naval  Postgraduate  School,  Monterey, 
Calif. :  19  October  1977. 


l'art-of-spet*ch  class.  This  simplo  hypothesis  will  bo  shown  to  have 
iiUorostinq  implications  for  both  oxporimontal  tiosicjn  aiul  statistical 
analysis. 


First,  fixed  sets  of  nouns  and  verbs  that  are  matched  on  relevant 
parameters,  such  as  numlier  of  letters  and  frequency  of  occurrence,  are 
prepared.  To  qeiieraliue  to  the  full  donwiin  of  nouns  and  vi>rbs,  each 
suliject  should  receive  a  di f ferent  random  sample  of  words  from  the  two 
lists.  It  is  impossible,  however,  to  match  the  words  on  all  relevant 
variables.  It  is  also  practically  impossible  to  use  a  different  random 
sample  of  words  for  each  subject. 

Consider  the  exper imt'iital  desiqn  shown  in  Table  1,  in  which  "s" 
subjects  each  are  presented  "w"  different  nouns  and  verbs.  To  comj'are 
the  adequacy  of  the  several  possible  F  ratios  for  tostinq  tlie  differ¬ 
ence  in  response  time  to  the  two  "treatment"  (part  of  speech)  condi¬ 
tions,  Table  2  and  Table  3,  which  show  expected  mean  squares  (FMS) ,  will 
be  helpful. 


Tab  1 e  1 

Assiqnnu'nt  of  Subjects  and  Parts  of  Speech 


Part  of  speech 

Subject 

P^  (nouns) 

P  (verbs) 

P 

s 

w  .  .  .  .  w  ,  ^ 

w  ,  ,  ,  .  .  .  .  w 

1 

S 

s 

1  w/2 

w/2+l  w 

If  the  siqnificance  of  the  Parts  of  Speech  treatment  is  testt'd, 
the  appropriate  F  ratio  for  the  model  illustrated  in  Table  2  is 

F  =  MS  /MS  .  The  only  term  in  the  numerator  that  is  not  in  the 

1  p  pxs  2 

denominator  is  swa  .  However,  if  this  same  F  ratio  is  used  with  the 
P 

model  in  Table  3  (applicable  when  qeneralization  is  desired  to  all 
nouns  and  verbs),  then  this  F  ratio  will  contain  two  terms  that  are 

2  2 

not  in  the  denominator:  so  ,  ,  and  swo  .  Usinq  alternative  error 

w(p)  P 

terms  in  the  parts-of-speech  fixed,  words-random  model  (Table  2)  also 
leads  to  the  same  problem.  For  example,  if  we  tost  the  parts-of-speech 
effect  against  the  words  within  parts-of-speech  effect,  we  obtain 

F  =  MS  /MS  ...  In  this  case,  EMS  exceeds  EMS  by  the  amount  of 

2  p  w(p)  P  wtp) 


Table 


(EMS)  ,  Assumiiiq  Parts  of  Speech  Is  a  Fixed  Factor  and 
Subjects  and  Words  Are  Random 


Source 


EMS 


P  (Part  of  speech) 

WIP)  (Words  withvn  part  of  speech)  o'  so 


2  2  J 

0  +  swo  +  so '  +  wo '  to' 

e  P  w(p)  pxs  sxv 


S  (Subjects) 


P  x  S 


S  X  W(P) 


w(p)  ^'sxw(p) 


0  t  pwv'  +  0 
e  s  sxw(p) 


0  +  wo  to 

e  pxs  sxw(p) 


2  2 
0  +  o 


sxw  (^^) 


Table  3 

(EMS) ,  Assuminq  Parts  of  Speech  and  Words  Are  Fixed 
and  Subjects  Are  Random 


Source 


P 

W(P) 

S 

P  X  S 

S  X  W(P) 


EMS 


2  2  2 

o  t-  swo'  +  wo' 
e  p  pxs 

2  2  2 

e  w(p)  sxw(p) 

2  2 

0  +  pwil 

e  s 

2  2 

o  +  wo 
e  pxs 

2  2 
e  sxw(p) 


3 


Therefore,  this  F  ratio  would  also  be  significant  when 
P  2  ^ 

the  true  contribution  of  due  to  parts  of  speech  (treatments)  is 

really  zero.  In  summary,  both  F  and  F  could  be  significant  when 

2  2  2  ^^ 

o  =0,  provided  that  o  and  o  exceed  zero. 

P  w  pxs 


A  possible  solution  to  this  dileima  is  to  take  the  "quasi-F"  ratio, 

or  F',  which  equals  (MS  +  MS  ,  , ) / (MS  +  MS  ,  . ) .  Now  the  only 

p  sw(p)  pxs  w(p)  2 

term  in  the  numerator  that  is  not  in  the  denominator  is  .  However, 

P 

F'  is  distributed  only  approximately  as  F,  although  the  error  involved 
is  not  large  provided  that  adjustments  are  made  to  the  degrees  of  freedom. 

A  more  conservative  solution  is  minimum  F',  which  assumes  that 
^^sxw(p)  zero.  A  detailed  discussion  of  this  problem  may  be  found 

in  Clark  (1973) . 

A  series  of  Monte  Carlo  computer  simulations  (Forster  &  Dickinson, 
1976)  explored  the  relationship  between  all  of  the  above  F  ratios  and 
the  resulting  Type  I  error  rates.  Generally,  Fj^  and  F2  alone  produced 
unacceptably  high  error  rates,  whereas  F'  and  min  F'  were  more  conserva¬ 
tive,  as  shown  in  Table  4. 

As  shown  in  Table  5,  increasing  the  number  of  items  and  subjects 
tends  to  decrease  Fj^  Type  I  error  for  the  fixed  effects  model  where 
only  subjects  are  random.  Min  F'  and  F'  continue  to  have  lower  error 
rates . 


The  "Magnitude  of  Effect"  as  a  Summary  Statistic 

The  F  ratio  indicates  the  level  of  statistical  significance  that 
can  be  attributed  to  a  particular  treatment.  The  degree  of  statistical 
significance  is  a  joint  function  of  the  "true"  strength  of  that  factor, 
the  error  variability,  which  reflects  the  degree  of  experimental  con¬ 
trol,  and  the  sample  size  (i.e.,  number  of  subjects  tested).  As  sample 
size  increases,  there  is  increasing  power  to  reject  a  false  null  hy¬ 
pothesis.  Thus,  in  conducting  large-scale  experiments  with  hundreds 
of  subjects,  the  large  "n"  may  be  necessary  to  detect  a  weak  "signal" 
buried  in  a  background  of  "noisy"  data.  But  the  large  n  may  also  lead 
to  spuriously  significant  F  ratios  that  are  actually  statistical 
artifacts . 

One  index  for  assessing  the  significance  of  effects  is  the  "magni¬ 
tude  of  effect,"  also  referred  to  as  the  "proportion  of  variance  ac¬ 
counted  for."  It  is  interesting  to  note  that  relatively  few  research 
papers  have  included  this  index  as  coii¥>ared  to  the  occurrence  of  ubiqui¬ 
tous  F  ratio.  Basically,  the  magnitude  of  effect  (m.e.)  measures  the 
degree  of  association  loetween  the  independent  variable (s)  and  the 


Table  4 


Tyi^e  I  Error  Rates  as  a  Function  of  Variation 

in  MS  and  MS  ,  . 
sxp  w(p) 


Source  of  variance 
manipulated 

S.D. 

1 

S.D. 2 

F 

1 

F  min  F' 

2 

F' 

Neither 

0 

0 

.044 

.046 

.010 

.026 

(d) 

5 

0 

.228 

.052 

.038 

.044 

10 

0 

.484 

.070 

.060 

.060 

15 

0 

.586 

.056 

.048 

.052 

20 

0 

.724 

.050 

.048 

.048 

MS 

0 

5 

.042 

.  146 

.024 

.036 

sxp 

0 

10 

.064 

.  388 

.042 

.042 

0 

15 

.036 

.  520 

.032 

.034 

0 

20 

.042 

.  588 

,038 

.042 

Both 

5 

5 

.124 

.096 

.034 

.042 

10 

10 

.190 

.090 

.040 

.040 

15 

15 

.220 

.  138 

.056 

.064 

20 

20 

.208 

.118 

.048 

.048 

Note ;  500 

observations 

per 

situation,  alpha 

=  .05, 

p  =  2,  q 

=  5, 

r  =  9. 

Table  5 

Type  I  Error 

Rates  as  a 

Function 

of  the 

Numbers 

of 

Subjects 

and  Items 

Subjects 

Items 

"l 

F 

2 

min  F ' 

F' 

10 

5 

.240 

.070 

.040 

.040 

10 

20 

.090 

.290 

.053 

.053 

20 

5 

.307 

.077 

.067 

20 

20 

.193 

.217 

.060 

.060 

Note :  300  observations  per  situation,  S.D.^^  =  S.D.^  -  20,  and 

A 1 nhA  =  - OS . 


dependent  variable(s).  In  the  simplest  case  for  ANOVA  having  fixed 
factors,  none  of  which  are  repeated,  the  m.e.  formula  is  magnitude  of 


ef  feet 


(SS 


^  -  df  X  ,,  _ _  ,  . 

effect  effect  error  total  error), 

for  deriving  m.e.  indexes  are  provided  by  Dodd  and  Schultz  (1973), 
along  with  tables  for  representative  ANOVA  designs. 


MS  )/(SS 

error  total 


+  MS 


Rules 


The  present  paper  is  concerned  with  the  interpretation  of  these 
summary  statistics,  because  both  F  and  m.e.  can  be  computed  from  the 
samt?  set  of  data.  It  is  clear  that  as  the  statistical  significance  for 
a  given  effect  increases — that  is,  the  p (observation/null )  decreases — 
the  magnitude  for  tliat  effect  also  increases.  But  it  is  also  possible 
that  an  F  ratio  may  be  highly  statistically  significant,  yet  the  m.e. 
for  that  effect  could  account  only  for  a  very  small  pro{X3rtion  of  the 
overall  variance.  The  results  from  an  experiment  summarized  in  the 
following  section  show  that  when  statistical  significance  (p  <  .001) 
was  achieved  by  several  treatments,  the  m.e.  for  these  treatments  ranged 
from  1%  to  23%. 


A  Study  of  Marlcsmanship 

An  experiment  was  conducted  for  the  U.S.  Army  Military  Police 
School  at  Fort  McClellan,  Ala.,  in  which  237  students  each  shot  a  total 
of  240  handgun  rounds  from  eight  different  position-distance  combina¬ 
tions.  There  were  three  repetitions  of  80  sliots  each,  at  stationary 
silhouette  targets.  Within  each  repetition,  5  shots  were  talcen,  the 
weafxjn  was  reloaded,  and  5  more  shots  were  fired  in  the  adjacent  test 
lane.  (Each  subject  had  previously  passed  a  training  course  with  a 
score  of  at  least  35  hits  out  of  50  shots.)  In  the  test,  160  trials 
(two  repetitions)  were  ta)cen  on  Thursdays  and  the  third  repetition  was 
talcen  on  Fridays.  The  completely  crossed  design  was  therefore  A  x  B 
X  C  X  D,  or  237  x  2  x  8  x  3,  or  subjects  x  lanes  x  tables  x  repetitions. 

Table  6  highlights  the  results  of  the  ANOVA  from  this  experiment. 
The  first  column  of  F  ratios  assumes  a  mixed  model,  with  B,  C,  D  as 
fixed  factors.  The  second  column  of  F  ratios  assumes  that  only  the 
Tables  factor  was  a  fixed  factor.  The  third  F  ratio  column  assumes 
that  all  four  factors  were  randomly  sampled  from  their  respective  popu¬ 
lations.  The  point  is  rather  obvious:  Different  ANOVA  models  produce 
different  F  ratios  for  null  hypothesis  rejection,  given  the  same  set 
of  data. 

The  problem  of  interpreting  the  F  ratios  needs  to  be  addressed. 

Is  there,  for  example,  a  significant  effect  due  to  Lanes  or  to  Repeti¬ 
tions?  If  these  effects  are  assumed  to  be  fixed,  the  answer  is  yes; 
if  they  are  assumed  to  be  random,  the  answer  for  Lanes  is  no;  and  for 
repetitions  the  level  of  statistical  significance  has  decreased  greatly. 


6 


Table  6 


Changes  in  F  Ratios  as  a  Function  of  ANOVA  Model 


Source 

df^ 

MS 

f" 

F^ 

F^' 

A  (Subjects) 

236 

12.80 

3.93**** 

2.54**** 

B  (banes) 

1 

7.70 

7.33**** 

5.96** 

2.26 

C  (Tables) 

7 

732.71 

385.64**** 

79.11**** 

79.11*** 

[)  (Repetitions) 

2 

34.75 

14.18**** 

12.55**** 

4.71** 

****p  ^  .001 . 

***p  <  .01. 

•*P  .025. 


a 

df  for  F  ratios  were  obtained  using  the  Satterthwaite  approximation, 
b 

A  random:  B,  C,  D  fixed  effects. 

"“a,  B,  D  random;  C  fixed  effects. 

'^A,  B,  C,  D  all  random  effects. 


We  offer  the  suggestion  that  the  choice  of  the  ANOVA  model,  and 
ultimately  the  level  of  significance  reached,  lies  in  the  eyes  of  the 
beholder — the  scientist.  From  a  sponsor's  perspective,  only  those  con¬ 
ditions  studied  in  the  experiment  may  be  of  interest.  If  many  lanes, 
repetitions,  or  even  tables  are  never  to  be  studied  or  added  to  the 
sponsor's  testing  program,  then  those  factors  would  neier  be  sampled 
from  a  larger  population  of  such  factors.  One  might  argue  from  a  sci¬ 
entific  point  of  view,  however,  that  many  additional  lanes,  repetitions, 
and  firing  positions  could  have  been  tested;  that  is,  we  happen  to  have 
chosen  only  three  repetitions,  two  lanes  per  subject,  and  eight  differ¬ 
ent  distance-position  combinations.  Thus,  the  sponsor-practitioner 
wants  information  that  is  specific  to  his  or  her  particular  test.  In 
contrast,  the  scientific  "purist"  may  perceive  this  one  test  or  experi¬ 
ment  as  merely  one  of  many  different  kinds  that  could  have  been  con¬ 
ducted  for  the  sponsor.  Hence,  the  choice  of  model  indeed  influences 
the  significance  levels  obtained. 


7 


The  power  of  the  F  ratio  to  reject  a  false  null  hypothesis  is  a 
function  of  (a)  the  "true"  strength  of  the  particular  factor  and 
(b)  the  sample  size.  Although  a  large  sample  size  may  help  to  detect 
a  weak  signal  in  a  noisy  background,  using  such  a  large  s^unple  can  lead 
to  increasingly  significant  F  ratios  with  little,  if  any,  concomitant 
increase  in  the  m.e.  It  is  to  this  latter  summary  statistic  that  we 
now  turn  our  attention,  in  the  analysis  of  the  same  set  of  marksmanship 
data. 


The  m.e.  results  are  shown  in  Table  7.  This  table  shows  that  the 
largest  effect  other  than  random  error  was  due  to  the  Tables  factor, 
which  captured  a  23%  share  of  the  total  score  variability.  The  effect 
due  to  the  Subject  factor,  which  reflected  individual  differences  among 
the  students,  reached  nearly  10%.  Several  interaction  terms,  in  which 
Tables  was  a  factor,  accounted  for  about  6%  to  7%. 


Table  7 

Changes  in  Magnitude  of  Effect  Index  as  a 
Function  of  ANOVA  Model 


Source 

Proportion  of  total 

variance 

A  random, 
B,C,D  fixed 

A,B,D  random, 

C  fixed 

A,B,C,D  random 

A  (Subjects) 

.0852 

.1027 

.1030 

B  (Lanes) 

.0004 

.0006 

.0005 

C  (Tables) 

.1643 

.2454 

.2631 

D  (Repetitions) 

.0027 

.0041 

.0042 

Note  that  the  effect  due  to  Repetitions  in  Table  5  was  statistically 
significant,  whereas  Repetitions  contributed  an  effect  worth  only  about 
.4%  in  Table  7.  The  reason  for  this  apparent  discrepancy  between  the 
two  summary  statistics  is  due  to  the  large  number  of  subjects,  which  in 
turn  produced  a  large  number  of  degrees  of  freedom.  This  allows  small 
F  ratios  to  achieve  statistical  significance  more  readily.  Thus,  the 
values  for  m.e.  in  Table  7  act  as  a  check  upon  the  significance  levels 
listed  in  Table  6.  Therefore,  the  effect  due  to  Repetitions  reveals  a 
slight,  but  probably  inconsequential,  learning  effect.  A  similar  line 
of  reasoning  holds  for  the  interpretation  of  tne  Lanes  variable  in 
Tables  6  euid  7. 


8 


Sunmary  and  Conclusions 


In  actual  experimental  testing  situations,  it  may  not  be  easy  to 
determine  whether  a  given  treatment  should  be  classified  as  a  fixed  or 
as  a  random  effect.  For  example,  in  the  experiment  outlined,  the  Lanes, 
Repetitions,  and  Tables  factors  could  be  considered  as  either  fixed  or 
random.  The  Tables  factor  had  eight  levels,  representing  the  eight 
specific  position-distance  combinations  that  comprise  the  marksmanship 
test.  Because  there  are  theoretically  an  infinite  number  of  distance- 
position  combinations.  Tables  could  be  interpreted  as  a  sampling  of 
eight  from  this  much  larger  population.  A  random  effects  assignment 
to  Tables  could  easily  be  justified  because  an  experimenter  is  often 
interested  in  generalizing  results  beyond  the  specific  treatment  levels 
to  a  larger  set  of  "real-world"  circumstances.  Furthermore,  the  proba¬ 
bility  of  falsely  rejecting  a  true  null  hypothesis  is  less  when  a  treat¬ 
ment  is  considered  to  be  random. 

In  sum,  the  wise  use  of  an  ANOVA  model  involves  (1)  determination 
of  fixed  versus  random  factors,  (2)  computation  of  complete  sets  of 
summary  statistics,  and  (3)  interpretation  of  the  statistics. 


REFERENCES 


Clark,  H.  H.  The  Language-as-Fixed-Ef fect-Fallacy :  A  Critique  of 

Language  Statistics  in  Psychological  Research.  Journal  of  Verbal 
Learning  and  Verbal  Behavior,  1973,  1^,  335-359. 

Dodd,  D.  H.,  &  Schultz,  R.  F.  Computational  Procedures  for  Estimating 
■Magnitude  of  Effect  for  Some  Analysis  of  Variance  Designs.  Psy¬ 
chological  Bulletin,  1973,  79^,  392-395. 

Forster,  K.  I.,  &  Dickinson,  R.  G.  More  on  the  Language-as-Fixed-Ef feet 
Fallacy:  .Monte  Carlo  Estimates  of  Error  Rates  for  F^^ ,  F2,  F'  and 

min  F'.  Journal  of  Verbal  Learning  and  Verbal  Behavior,  1976,  15 , 
135-142. 


< 


■"-ajT - ' 


PRECEDlWa  FACJfi  HLAUK 


11 


